class: center, middle, inverse, title-slide .title[ # Lecture 7: Programming ] .author[ ### James Sears*
AFRE 891 SS 24
Michigan State University ] .date[ ### .small[
*Parts of these slides are adapted from
“Data Science for Economists”
by Grant McDermott and
“Advanced Data Analytics”
by Nick Hagerty.] ] --- <style type="text/css"> # CSS for including pauses in printed PDF output (see bottom of lecture) @media print { .has-continuation { display: block !important; } } .remark-code-line { font-size: 95%; } .small { font-size: 75%; } .scroll-output-full { height: 90%; overflow-y: scroll; } .scroll-output-75 { height: 75%; overflow-y: scroll; } </style> # Table of Contents 1. [Prologue](#prologue) 2. [If/Else Statements](#ifelse) 3. [For Loops](#for) 4. [Functions](#fun) 5. [Indirection and Name Injection](#indirect) 6. [Vectorization](#vec) 7. [Parallelization](#parallel) --- class: inverse, middle name: prologue # Prologue --- # Programming So far in class we've learned how to do a lot of things in R, but we can exponentially increase our data analytics skills (and how quickly we get things done) by learning some .hi-blue[programming]. -- * Write custom functions to execute specific tasks * Scrape all Yellowpages business links for a given search term in hundreds of different cities * Conditionally define variables or execute different tasks * Create a variable conditional on another variables' values * Perform a repeated task by looping over values * Create a set of state-level dummy variables from state FIPS codes * Run tasks efficiently in parallel * Calculate parcel or farm-level measures of precipitation and temperature --- # Programming Packages we'll use today: ```r pacman::p_load(dslabs, tidyverse, furrr, tictoc, future, progressr) ``` -- And let's load in the `murders` data from the `dslabs` package: ```r data(murders) ``` --- class: inverse, middle name: ifelse # If/Else Statements --- # If/Else Statements If/else statements are a type of .hi-medgrn[conditional expression]. * Check to see if a logical condition is True * If True, do a thing * If False: * Do a different thing, * Do nothing, or * Check *another* condition, do a thing if True, etc. --- # If/Else Statements For example: print the reciprocal of `a`, unless `a` is 0. ```r a = 0 if(a != 0) { print(1 / a) } else { print("Reciprocal does not exist.") } ``` ``` ## [1] "Reciprocal does not exist." ``` -- Statements like this are used for .hi-blue[control flow] of your code. * Used all the time in software development * Used occasionally in data analysis, more often in custom functions and packages. --- # If/Else Statements You can also link together multiple condition with `else if`s. ```r if(a > 0) { print("a is Positive") } else if (a < 0){ print("a is Negative") } else { print("a is Zero") } ``` ``` ## [1] "a is Zero" ``` --- # If/Else Statements A related function that you *will* use all the time in data analysis: `ifelse`. -- .center[ syntax: `ifelse(CONDITION, ACTION_IF_TRUE, ACTION_IF_FALSE)` ] * `CONDITION`: a logical condition *`ACTION_IF_TRUE`: what to do if the condition is true *`ACTION_IF_FALSE`: what to do if the condition is false -- For example: ```r a = 0 ifelse(a > 0, 1/a, NA) ``` ``` ## [1] NA ``` --- # If/Else Statements .center[ syntax: `ifelse(CONDITION, ACTION_IF_TRUE, ACTION_IF_FALSE)` ] `ifelse` is particularly useful because it is .hi-purple[vectorized] and can be applied over .hi-purple[a vector of elements all at once] -- For example, to change negative numbers to missing: ```r b = c(0, 1, 2, -3, 4) ifelse(b < 0, NA, b) ``` ``` ## [1] 0 1 2 NA 4 ``` --- # If/Else Statements .center[ syntax: `ifelse(CONDITION, ACTION_IF_TRUE, ACTION_IF_FALSE)` ] `ifelse` is particularly useful because it is .hi-purple[vectorized] and can be applied over .hi-purple[a vector of elements all at once] Or for adding a conditional variable - for example, whether or not a state is Michigan ```r murders <- murders %>% mutate( * is_michigan = ifelse(state == "Michigan", "Is Michigan", "Is Not Michigan") ) murders[c(1, 23:26),] ``` ``` ## state abb region population total is_michigan ## 1 Alabama AL South 4779736 135 Is Not Michigan ## 23 Michigan MI North Central 9883640 413 Is Michigan ## 24 Minnesota MN North Central 5303925 53 Is Not Michigan ## 25 Mississippi MS South 2967297 120 Is Not Michigan ## 26 Missouri MO North Central 5988927 321 Is Not Michigan ``` --- # case_when() While it's technically possible to use nested ifelses, friends don't let friends nest ifelses. -- Instead, use .hi-slate[dplyr's] `case_when()` ```r x <- 1:10 ## dplyr::case_when() case_when( x <= 3 ~ "small", x <= 7 ~ "medium", TRUE ~ "big" # Default value ) ``` ``` ## [1] "small" "small" "small" "medium" "medium" "medium" "medium" "big" ## [9] "big" "big" ``` --- # case_when() Works great within `mutate()` as well! ```r murders <- murders %>% mutate( my_opinion = case_when( state == "Michigan" ~ "Great State", state %in% c("California", "Hawaii") ~ "Also Solid State", state == "Missouri" ~ "More like Misery am I right", TRUE ~ "A State") ) murders[c(1, 5, 12, 23, 26, 38),c(1,7)] ``` ``` ## state my_opinion ## 1 Alabama A State ## 5 California Also Solid State ## 12 Hawaii Also Solid State ## 23 Michigan Great State ## 26 Missouri More like Misery am I right ## 38 Oregon A State ``` --- class: inverse, middle name: for # For Loops --- # Abstraction Often you will have tasks where you find yourself copying and pasting your code to do the same thing `\(n\)` times, with only minor tweaks each time. Q: What's wrong with that? -- A: It's: * Annoying (especially if `\(n\)` is large) * Hard to change later if needed * Prone to errors/bugs Instead, you can .hi-medgrn[abstract] your code: define it once, and run it multiple times. The rest of this lecture covers tools for abstraction in different situations. A good rule to aim for is to .hi-blue[never copy-and-paste more than twice.] If you're pasting more than that, abstract it instead! --- # Abstraction Methods There are several different methods for code abstraction that we'll go over: 1. .hi-blue[For loops:] when you want to repeat the same code for .hi-blue[different values of a variable or vector] 1. .hi-medgrn[Functions:] when you want to repeat the same code for potentially .hi-purple[different values of all arguments/variables] or with .hi-medgrn[different settings/samples] 1. .hi-purple[Vectorization and Functionals:] when you want to .hi-purple[repeat a function over different values of arguments] --- # For Loops The .hi-medgrn[for loop] is a simple tool for .hi-medgrn[iteration] ``` for (INDEX in RANGE){ action(INDEX) ``` ] * `INDEX` the name of the index you want to use (often `i` but can be anything) * `RANGE` the vector of values to iterate over (can be numbers, characters, or objects) --- # For Loops The .hi-medgrn[for loop] is a simple tool for .hi-medgrn[iteration] ```r for (i in 1:6){ print(paste0("It is ", i, " O'Clock.")) } ``` ``` ## [1] "It is 1 O'Clock." ## [1] "It is 2 O'Clock." ## [1] "It is 3 O'Clock." ## [1] "It is 4 O'Clock." ## [1] "It is 5 O'Clock." ## [1] "It is 6 O'Clock." ``` --- # For Loops You can also combine for loops with if-else: ```r for (i in c("Indiana", "Michigan", "Colorado")){ if (i == "Michigan"){ print("This is Michigan") } else { print("This is not Michigan") } } ``` ``` ## [1] "This is not Michigan" ## [1] "This is Michigan" ## [1] "This is not Michigan" ``` --- # For Loops Suppose you wanted to calculate the mean of the numeric variables in `murders` and the murder rate. We could manually type and copy-paste: ```r murders <- mutate(murders, rate = total/population * 1e5) mean(murders$total) ``` ``` ## [1] 184.3725 ``` ```r mean(murders$population) ``` ``` ## [1] 6075769 ``` ```r mean(murders$rate) ``` ``` ## [1] 2.779125 ``` --- # For Loops Or we could avoid copy-past errors and use a for loop: ```r for (var in c("total", "population", "rate")){ print(mean(murders[[var]])) } ``` ``` ## [1] 184.3725 ## [1] 6075769 ## [1] 2.779125 ``` --- # For Loops We an also loop over an .hi-blue[object in memory:] ```r numeric_col <- c("total", "population", "rate") for (var in numeric_col){ print(mean(murders[[var]])) } ``` ``` ## [1] 184.3725 ## [1] 6075769 ## [1] 2.779125 ``` --- # For Loops Or .hi-medgrn[assign output to memory] too ```r numeric_col <- c("total", "population", "rate") means <- vector() # initiate an empty vector for (var in numeric_col){ means[[var]] <- mean(murders[[var]]) } ``` -- </br> There is one technical problem with this code. The vector storing the output .hi-purple["grows" at each iteration], which can make the loop .hi-purple[very slow]. --- # For Loops .hi-blue[Better:] give your empty vector the .hi-blue[right length] *before* starting. ```r means <- vector("numeric", length = length(numeric_col)) # initiate an empty vector with same length for (i in 1:length(numeric_col)){ col_num <- which(colnames(murders) == numeric_col[i]) means[[i]] <- mean(murders[[col_num]]) } ``` --- # For Loops: Caveat For-loops are actually .hi-blue[discouraged in R programming]. * We're covering them because the concepts are foundational. * But R has nicer ways to iterate, called .hi-medgrn[vectorization]. * To do proper vectorization, we first need to know how to .hi-purple[write functions]. --- class: inverse, middle name: fun # Functions --- # Functions We've already seen a .hi-medgrn[multitude of functions] in R * pre-packaged with base R * loaded by different packages (e.g. `dplyr::mutate()`) Regardless of where they come from, the all follow the same basic syntax: .hi-center[ `function_name(ARGUMENTS)` ] --- # Custom Functions While we will often use pre-made functions, you can --- and should! --- write your own functions too. This is easy to do with the generic **`function()`** function.<sup>1</sup> If you only have a short function, you can write it all on a .hi-medgrn[single line:] ```r function(ARGUMENTS) OPERATIONS ``` .footnote[<sup>1.</sup> Yes, it's a function that let's you write functions. Very meta.] --- # Custom Functions Oftentimes we want our function code to span .hi-pink[multiple lines]. In this case we can use brackets: ```r function(ARGUMENTS) { OPERATIONS return(VALUE) } ``` .footnote[<sup>1.</sup> Yes, it's a function that let's you write functions. Very meta.] --- # Custom Functions Rather than write .hi-medgrn[anonymous] functions, we can .hi-blue[name our functions] to assign them to memory and reuse them throughout our file: ```r my_func <- function(ARGUMENTS) { OPERATIONS return(VALUE) } ``` Try to give your functions short, pithy names that are * Informative to you * Clear to anyone else who might read the code --- # Building Custom Functions Let's start with a basic function: calculate a .hi-medgrn[number's square].<sup>2</sup> ```r square <- # function name function(x){ # the arguments of our function (here just one) x^2 # the operation(s) that our function performs } ``` -- Testing: ```r square(4) ``` ``` ## [1] 16 ``` .footnote[<sup>2</sup> I want to note that this .hi-blue[isn't a useful function]. R's arithmetic function already handle vectorised exponentiation and do so very efficiently.] --- # Specifying Return Values We can .hi-blue[specify return values] with `return()` * Helpful when our function performs a bunch of intermediate steps ```r square <- function(x){ * x_sq <- x^2 # assign squared value as intermediate object * return(x_sq) } ``` --- # Specifying Return Values Testing: ```r square(3) ``` ``` ## [1] 9 ``` Note that the intermediate objects .hi-green[don't stay in memory] - they're automatically removed as soon as the function is done running. -- If we left out the `return()`, the function will return .hi-pruple[the result of the very last operation] --- # Specifying Return Values If we want to return .hi-purple[multiple objects] from our function, we need to either .hi-medgrn[1\. Use a List] ```r square_list <- function(x){ x_sq <- x^2 # assign squared value as intermediate object * res <- list(value = x, val_squared = x_sq) return(res) } square(3) ``` ``` ## [1] 9 ``` --- # Specifying Return Values If we want to return .hi-purple[multiple objects] from our function, we need to either .hi-blue[2\. Build a data frame] (a tidy solution!) ```r square_df <- function(x){ x_sq <- x^2 # assign squared value as intermediate object * res <- data.frame(value = x, val_squared = x_sq) return(res) } square(3) ``` ``` ## [1] 9 ``` --- # Default Argument Values We can also assign .hi-medgrn[default argument values] * Allows for all/any arguments to be optional * Use the supplied value when supplied * Use default value when not -- Suppose we wanted to expand our function to do any exponent and not just squares: ```r raise_power <- function(x = 2, power = 2){ res <- data.frame( value = x, power = power, value_raised = x^power ) return(res) } ``` --- # Default Argument Values Setting default values doesn't affect typical function usage: ```r raise_power(x = 5, power = 3) # uses specified values ``` ``` ## value power value_raised ## 1 5 3 125 ``` -- But now any argument that we omit will .hi-green[use the default values] and the function will run: ```r raise_power() # uses default values of x and power = 2 ``` ``` ## value power value_raised ## 1 2 2 4 ``` --- # Default Argument Values Setting default values doesn't affect typical function usage: ```r raise_power(x = 5, power = 3) # uses specified values ``` ``` ## value power value_raised ## 1 5 3 125 ``` Without supplying argument values, our previous function wouldn't have worked: ```r square() ``` ``` ## Error in square(): argument "x" is missing, with no default ``` --- class: inverse, middle name: indirect # Indirection and Name Injection --- # Indirection A common use-case for custom functions is .hi-purple[iterating over variables] * Repeat a cleaning task over multiple variables in a data frame * Run analysis with a different dependent variable -- For example, let's go back to our `square` function. By default it applies over an entire vector: ```r square(murders$rate) ``` ``` ## [1] 7.9773697 7.1566199 13.1734682 10.1722092 11.3848093 1.6704350 ## [7] 7.3656453 17.9092897 270.6930871 11.5468718 14.3665453 0.2648049 ## [13] 0.5860059 8.0483466 4.7964199 0.4752012 4.8757523 7.1460033 ## [19] 59.9475608 0.6857300 25.7542601 3.2478494 17.4608856 0.9985205 ## [25] 16.3546200 28.7284388 1.4709757 3.0699848 9.6750631 0.1442507 ## [31] 7.8289826 10.5867194 7.1180103 8.9959426 0.3536860 7.2206276 ## [37] 8.7552904 0.8830065 12.9438141 2.3106835 20.0285200 0.9654707 ## [43] 11.9089569 10.2487076 0.6335858 0.1021576 9.7631255 1.9126730 ## [49] 2.1231443 2.9092373 0.7869696 ``` --- # Indirection We could use it *within* a mutate if we want a new column in our data frame: ```r murders <- murders %>% mutate(rate_sq = square(rate)) select(murders, starts_with("rate")) %>% head() ``` ``` ## rate rate_sq ## 1 2.824424 7.977370 ## 2 2.675186 7.156620 ## 3 3.629527 13.173468 ## 4 3.189390 10.172209 ## 5 3.374138 11.384809 ## 6 1.292453 1.670435 ``` But doing this for a lot of variables would require a lot of typing (and wouldn't vectorize over multiple variables well) --- # Indirection What we might want to do is modify our function to use .hi-purple[variable names and the dataframe] as the arguments to directly add a new variable: ```r square_df <- function(var, # variable to square df){ # data frame to square variables in df <- mutate(df, newvar = var * var) return(df) } ``` --- # Indirection However, if we try and use this function on the `rate` variable in `murders` with a string, we get an error: ```r square_df( var = "rate", df = murders) ``` ``` ## Error in `mutate()`: ## ℹ In argument: `newvar = var * var`. ## Caused by error in `var * var`: ## ! non-numeric argument to binary operator ``` --- # Indirection We get a similar error if we give the variable argument as a .hi-blue[data-variable] * .hi-blue[data-variable]: a "statistical" variable that lives .hi-blue[in a data frame] ```r square_df( var = rate, df = murders) ``` ``` ## Error in `mutate()`: ## ℹ In argument: `newvar = var * var`. ## Caused by error: ## ! object 'rate' not found ``` --- # Indirection This is an issue of .hi-purple[indirection], which occurs in cases like this - Want to interpret the argument as an .hi-medgrn[environment-variable] rather than as as a .hi-blue[data-variable]. - .hi-medgrn[env-variable]: "programming" variable/object that lives in your environment (i.e. data frame created with `<-`) <br> Fortunately, there are a couple programmatic ways around this. --- # Indirection .hi-purple[Solution A:] provide the argument as a .hi-blue[data-variable], and 1. .hi-slate[defuse] the string with `enquo()` 1. .hi-dkgrn[unquote] the defused string in operations with `!!defused_string` ```r square_def <- function(var, # data-var rather than a string df){ * var <- enquo(var) # defuse the string df <- mutate(df, * newvar = !!var * !!var # square the defused string ) return(df) } square_def(rate, murders) %>% select(rate, newvar) %>% head() ``` ``` ## rate newvar ## 1 2.824424 7.977370 ## 2 2.675186 7.156620 ## 3 3.629527 13.173468 ## 4 3.189390 10.172209 ## 5 3.374138 11.384809 ## 6 1.292453 1.670435 ``` --- # Indirection .hi-purple[Solution B:] provide the argument as a .hi-blue[data variable], and within function operations .hi-purple[embrace] the argument with double braces `{{ var }}` ```r square_embr <- function(var, # data-var rather than a string df){ df <- mutate(df, * newvar = {{ var }} * {{ var }} ) return(df) } square_embr(rate, murders) %>% select(rate, newvar) %>% head() ``` ``` ## rate newvar ## 1 2.824424 7.977370 ## 2 2.675186 7.156620 ## 3 3.629527 13.173468 ## 4 3.189390 10.172209 ## 5 3.374138 11.384809 ## 6 1.292453 1.670435 ``` --- # Indirection .hi-purple[Solution C:] defuse the string with `ensym()` * Allows for supplying the argument as either a .hi-pink[character string] or a .hi-blue[data variable] ```r square_ensym <- function(var, df){ df <- mutate(df, * newvar = !!ensym(var) * !!ensym(var) # square the defused string ) return(df) } square_ensym("rate", murders) %>% select(rate, newvar) %>% head(3) ``` ``` ## rate newvar ## 1 2.824424 7.97737 ## 2 2.675186 7.15662 ## 3 3.629527 13.17347 ``` ```r square_ensym(rate, murders) %>% select(rate, newvar) %>% head(3) ``` ``` ## rate newvar ## 1 2.824424 7.97737 ## 2 2.675186 7.15662 ## 3 3.629527 13.17347 ``` --- # Name Injection We can combine defusing or embracing with .hi-medgrn[name injection] to customize our variable names. * i.e. call the new squared rate variable `rate_sq` rather than `newvar` Often we want to programmatically create new variable names based either on 1. A supplied character string as a function argument, or 1. Iterating on the data-variable's name directly in the function --- # Name Injection .hi-slate[Approach 1:] use .hi-green[glue syntax] and .hi-blue[supply the new name as a third argument]: * `newname` the new variable name as a character string * Glue syntax with `"{newname}"` * Programmatic assignment operator `:=` instead of `=` ```r square_inj_1 <- function(var, df, * newname){ # new variable name to use df <- mutate(df, * "{newname}" := {{ var }} * {{ var }} ) return(df) } square_inj_1(rate, murders, "rate_sq") %>% select(rate, rate_sq) %>% head() ``` ``` ## rate rate_sq ## 1 2.824424 7.977370 ## 2 2.675186 7.156620 ## 3 3.629527 13.173468 ## 4 3.189390 10.172209 ## 5 3.374138 11.384809 ## 6 1.292453 1.670435 ``` --- # Name Injection .hi-slate[Approach 1] works with `ensym()` too ```r square_inj_1b <- function(var, df, * newname){ # new variable name to use df <- mutate(df, * "{newname}" := !!ensym(var) * !!ensym(var) ) return(df) } square_inj_1b("rate", murders, "rate_squared") %>% select(rate, rate_squared) %>% head() ``` ``` ## rate rate_squared ## 1 2.824424 7.977370 ## 2 2.675186 7.156620 ## 3 3.629527 13.173468 ## 4 3.189390 10.172209 ## 5 3.374138 11.384809 ## 6 1.292453 1.670435 ``` --- # Name Injection .hi-slate[Approach 2A:] use .hi-green[glue syntax] and .hi-purple[create the name from the data-variable]: * `expr()` "defuses" the supplied expression * Converts the data-variable (i.e. `rate`) to a name * Glue syntax with `"{newname}"` * Programmatic assignment operator `:=` instead of `=` ```r square_inj_2a <- function(var, df){ new_var <- expr(rate) %>% paste0("_sq") # create new variable name internally df <- mutate(df, "{new_var}" := {{ var }} * {{ var }} ) # glue syntax to assign new name return(df) } square_inj_2a(rate, murders) %>% select(rate, rate_sq) %>% head() ``` ``` ## rate rate_sq ## 1 2.824424 7.977370 ## 2 2.675186 7.156620 ## 3 3.629527 13.173468 ## 4 3.189390 10.172209 ## 5 3.374138 11.384809 ## 6 1.292453 1.670435 ``` --- # Name Injection .hi-slate[Approach 2B:] use .hi-green[glue syntax] and .hi-pink[embracing]: * Glue syntax with `"{{newname}}_sq"` (no intermediate name object) * Programmatic assignment operator `:=` instead of `=` ```r square_inj_2b <- function(var, df){ df <- mutate(df, "{{ var }}_sq" := {{ var }} * {{ var }} ) # Glue syntax without intermediate object return(df) } square_inj_2b(rate, murders) %>% select(rate, rate_sq) %>% head() ``` ``` ## rate rate_sq ## 1 2.824424 7.977370 ## 2 2.675186 7.156620 ## 3 3.629527 13.173468 ## 4 3.189390 10.172209 ## 5 3.374138 11.384809 ## 6 1.292453 1.670435 ``` --- # Name Injection .hi-slate[Approach 2C:] you guessed it, `ensym()` still works ```r square_inj_2c <- function(var, df){ df <- mutate(df, "{{ var }}_sq" := !!ensym(var) * !!ensym(var) ) # Glue syntax without intermediate object return(df) } square_inj_2c("rate", murders) %>% select(rate, rate_sq) %>% head() ``` ``` ## rate rate_sq ## 1 2.824424 7.977370 ## 2 2.675186 7.156620 ## 3 3.629527 13.173468 ## 4 3.189390 10.172209 ## 5 3.374138 11.384809 ## 6 1.292453 1.670435 ``` --- class: inverse, middle name: vec # Vectorization --- # Vectorization Where the real benefits of custom functions, indirection, and name injection come in are with .hi-medgrn[vectorization] and .hi-purple[functionals]. These approaches give a new way to repeatedly iterate a function over a vector of argument values. <br> -- Two main approaches: 1. .hi-pink[apply] family - `apply()`, `lapply()`, `sapply()`, `mapply()` 1. Tidy.hi-blue[map] .hi-green[list] functions in .hi-slate[purrr] - `map()` and `map2()` with `list_c()`, `list_rbind()`, `list_cbind()` - Recently superseded the `map_dfr()`, `map_dfc()` functions --- # apply Family The base R .hi-pink[apply] family gives methods for iterating a function over a vector of arguments depending on the format and type of output we want | Function | Description | Output Type | |------------------|-------------------------------------|-------------| | `lapply(X, FUN)` | apply `FUN` to every element of `X` | list | | `sapply(X, FUN)` | apply `FUN` to every element of `X` | vector, matrix, or array | | `vapply(X, FUN)` | `sapply` with specified output types | vector or array | | `mapply(FUN, ARG1, ARG2, ...)` | multivariate version of `sapply` | list| | `apply(X, MARGIN, FUN)` | apply `FUN` to every element of `X` over dimension `MARGIN` | vector, matrix, array, or list| --- # Apply Suppose you wanted to standardize all the numeric variables in the `murder` data. You might write a function like this: ```r calculate_z = function(x) { z = (x - mean(x)) / sd(x) return(z) } ``` --- # apply Functions However, applying it over all the numeric variables at once leads to this: ```r numeric_cols = c("total", "population", "rate") murder_numbers = murders[numeric_cols] calculate_z(murder_numbers) ``` ``` ## Error in is.data.frame(x): 'list' object cannot be coerced to type 'double' ``` This is an example of a function that .hi-slate[isn't vectorized.] --- # apply Functions While we could put our function into a for loop, a more efficient/legible approach would use `sapply`<sup>3</sup>: .center[ `sapply(X, FUN)` ] ```r sapply(murder_numbers, calculate_z) %>% head() ``` ``` ## total population rate ## [1,] -0.2090939 -0.18890769 0.01844305 ## [2,] -0.7003568 -0.78207215 -0.04231860 ## [3,] 0.2017034 0.04609577 0.34623812 ## [4,] -0.3869650 -0.46057478 0.16703781 ## [5,] 4.5426034 4.54448196 0.24225740 ## [6,] -0.5055457 -0.15254681 -0.60529343 ``` .footnote[<sup>3.</sup> `sapply` is an example of a .hi-dkgrn[functional:] a function that takes another function as an argument.] --- # map and list_ Functions The tidy alternative to the apply functions are the `map_` family in .hi-slate[purrr] * Work a lot like the `apply_` functions, but with tidyverse syntax * Combine with `list_` functions to convert to a vector or dataframe | Function | Description | Output Type | |------------------|-------------------------------------|-------------| | `map(X, FUN)` |apply `FUN` to every element of `X` | list | | `map2(X1, X2, FUN)` | apply `FUN` to every element of `X1` and `X2` | list | | `list_c()` | combine list elements into a vector | vector | | `list_rbind()` | combines elements into a data frame row-wise | data frame | | `list_cbind()` | combines elements into a data frame column-wise | data frame | --- # map() Just like with `sapply()` we can iterate our `calculate_z()` over all numeric variables: .center[ `map(X, FUN)` ] ```r z_map <- map(murder_numbers, calculate_z) class(z_map) ``` ``` ## [1] "list" ``` ```r z_map ``` ``` ## $total ## [1] -0.20909395 -0.70035678 0.20170342 -0.38696497 4.54260341 -0.50554566 ## [7] -0.37002488 -0.61989131 -0.36155483 2.05240907 0.81154693 -0.75117707 ## [13] -0.73000195 0.76072663 -0.17944878 -0.69188673 -0.51401570 -0.28955941 ## [19] 0.70567132 -0.73423697 0.46003990 -0.28108936 0.96824283 -0.55636595 ## [25] -0.27261931 0.57862059 -0.73000195 -0.64530146 -0.42508019 -0.75964712 ## [31] 0.26099376 -0.49707561 1.40868537 0.43039473 -0.76388214 0.53203532 ## [37] -0.31073453 -0.62836136 1.15458390 -0.71306185 0.09582781 -0.74694205 ## [43] 0.14664810 2.62837239 -0.68765170 -0.77235219 0.27793386 -0.38696497 ## [49] -0.66647658 -0.37002488 -0.75964712 ## ## $population ## [1] -0.18890769 -0.78207215 0.04609577 -0.46057478 4.54448196 -0.15254681 ## [7] -0.36463968 -0.75471286 -0.79788810 1.98404612 0.56032885 -0.68731900 ## [13] -0.65710605 0.98457785 0.05947420 -0.44156250 -0.46972841 -0.25309517 ## [19] -0.22481731 -0.69197452 -0.04405069 0.06877752 0.55502908 -0.11250276 ## [25] -0.45308586 -0.01265797 -0.74137874 -0.61938976 -0.49196633 -0.69370773 ## [31] 0.39589795 -0.58545167 1.93892646 0.50428228 -0.78755849 0.79594785 ## [37] -0.33880342 -0.32718313 0.96588390 -0.73217380 -0.21140871 -0.76692071 ## [43] 0.03940371 2.77958194 -0.48273487 -0.79438727 0.28062202 0.09456378 ## [49] -0.61550486 -0.05666841 -0.80344105 ## ## $rate ## [1] 0.01844305 -0.04231860 0.34623812 0.16703781 0.24225740 -0.60529343 ## [7] -0.02652691 0.59150707 5.56716958 0.25200063 0.41170537 -0.92199686 ## [13] -0.81983639 0.02354746 -0.23983062 -0.85084539 -0.23248671 -0.04312679 ## [19] 2.02085353 -0.79435800 0.93470252 -0.39776029 0.56980030 -0.72466598 ## [25] 0.51502367 1.05074992 -0.63770849 -0.41813467 0.13490835 -0.97687549 ## [31] 0.00769770 0.19323111 -0.04526064 0.08965294 -0.88937503 -0.03745866 ## [37] 0.07320842 -0.74892202 0.33330063 -0.51261054 0.69060107 -0.73145567 ## [43] 0.27352517 0.17191143 -0.80743030 -1.00137859 0.14065877 -0.56842993 ## [49] -0.53825749 -0.43706231 -0.77032620 ``` --- # list_ The `list_` functions provide a convenient way to convert `map()` output directly to a dataframe: * Loop our `square_inj_2c()` function over all three numeric variables * Combine each of the dataframes ```r map_sq <- map( c("total", "population", "rate"), # first argument: variable names square_inj_2c, # function to iterate over df = murders # additional static arguments ) %>% list_cbind(name_repair = "unique") # account for duplicated names class(map_sq) ``` ``` ## [1] "data.frame" ``` ```r colnames(map_sq) ``` ``` ## [1] "state...1" "abb...2" "region...3" ## [4] "population...4" "total...5" "is_michigan...6" ## [7] "my_opinion...7" "rate...8" "rate_sq...9" ## [10] "\"total\"_sq" "state...11" "abb...12" ## [13] "region...13" "population...14" "total...15" ## [16] "is_michigan...16" "my_opinion...17" "rate...18" ## [19] "rate_sq...19" "\"population\"_sq" "state...21" ## [22] "abb...22" "region...23" "population...24" ## [25] "total...25" "is_michigan...26" "my_opinion...27" ## [28] "rate...28" "rate_sq...29" "\"rate\"_sq" ``` --- class: inverse, middle name: parallel # Parallelization --- # Parallelization One distinct advantage of R over Stata is the ability to .hi-medgrn[run code in parallel] * i.e. split a repeated task across multiple CPU cores simultaneously * Useful in any situation where we would use `map()` - i.e. bootstrapping, extracting parcel-level raster information -- .pull-left[ .center.hi-medgrn[Stata] * SE: runs in "serial" on one core * MP Student: 4 core ($375/yr) * MP 8 Core: $655/yr ] -- .pull-right[ .center[.hi-blue[R] and .hi-slate[furrr]] * `future_map` functions work exactly like .hi-slate[purrr's] `map()` * Run across as many cores as your system has * See progress with .hi-slate[progressr] * Annual cost: $0 ] --- # The Power of Parallel To see the benefit of running code in parallel, let's write a .hi-blue[purposefully slow function:] ```r slow_square <- function(x = 1){ Sys.sleep(1/2) # wait half a second return(x^2) } ``` --- # The Power of Parallel How long does it take to run this function?<sup>5</sup> * Use `tic()` and `toc()` from .hi-slate[tictoc] to calculate elapsed time ```r tic() square_serial <- map(1:24, slow_square) toc() ``` ``` ## 12.27 sec elapsed ``` .footnote[<sup>5</sup> `sapply()` and `map()` take nearly the exact same time. There are also several [type-specific versions](https://www.rdocumentation.org/packages/purrr/versions/0.2.5/topics/map) of `map` in case you want output to be a logical, integer, double, or character, etc.] -- The function runs in .hi-medgrn[serial]. so it takes approximately `\(1/2*24 = 12\)` seconds * Using one core, runs for `\(x=1\)`, then when done moves on to `\(x=2, ..., 24\)` --- # Parallelization We can .hi-blue[speed this up]. Modern CPUs are made up of multiple .hi-pink[cores] (processing units) that can all be given tasks simultaneously, allowing us to run code in .hi-blue[parallel]. -- First, use `future::availableCores()` to determine how many cores you have: ```r availableCores() ``` ``` ## system ## 24 ``` Your number of cores will likely differ * Most laptops have at least 4-8 cores these days. * Even recent Chromebooks have 6! --- # furrr .hi-slate[furrr] functions make it easy to .hi-blue[parallelize] in just a few steps. 1. Set a "plan" for how the code will be run in parallel * Number of cores to use, how to execute tasks 1. Use `future_` version of your preferred `map_` function 1. Close parallel plan -- First, we will .hi-medgrn[set the plan] and tell R how to execute the parallel session: ```r # Calculate a "safe" number of cores (allow for background processes) n_cores = availableCores() - 2 # Set the "plan" plan(strategy = "multisession", # run in parallel in separate background R sessions workers = n_cores # use the desired number of cores ) ``` --- # furrr .hi-slate[furrr] functions make it easy to .hi-blue[parallelize] in just a few steps. 1. Set a "plan" for how the code will be run in parallel * Number of cores to use, how to execute tasks 1. Use `future_` version of your preferred `map_` function 1. Close parallel plan Next, let's repeat the previous analysis with `future_map()`. ```r tic() square_parallel <- future_map(1:24, slow_square) toc() ``` ``` ## 8.09 sec elapsed ``` --- # furrr .hi-slate[furrr] functions make it easy to .hi-blue[parallelize] in just a few steps. 1. Set a "plan" for how the code will be run in parallel * Number of cores to use, how to execute tasks 1. Use `future_` version of your preferred `map_` function 1. Close parallel plan Now that we're done with our parallel session, reset things back to serial: ```r plan("sequential") ``` --- # Benefits of Parallelization Here we reduced execution time by ~ 1/3 due to some overhead of creating/assigning objects to the cores. However, the benefits of parallel increase substantially with * Larger objects * Greater number of repetitions (must be independent tasks) * More cores -- For example, if we run our slow function over the integers 1 to 1,000: | Approach | Time | Time Savings | |----------|------|-------------------| | Serial | 1,017.3 Seconds| 0%| | Parallel, 5 cores | 204.33 Seconds | 80% | | Parallel, 10 cores | 103.87 Seconds | 90% | | Parallel, 20 cores | 51.49 Seconds| 95% | --- # Progress with *progressr* For longer tasks, it can be helpful to see progress. We can do this by using the functions within .hi-slate[progressr]. First, let's add a .hi-purple[progress indicator] to our function. ```r slow_square_prog <- function(x = 1){ p() # add in progress indicator Sys.sleep(1/2) # wait half a second return(x^2) } ``` --- # Progress with *progressr* Next, write a .hi-medgrn[wrapper function] to our future map to add in the progress bar: ```r par_slow_square <- function(x){ p <- progressor(steps = length(x)) future_map(x, slow_square_prog) } ``` --- # Progress with *progressr* Finally, wrap the function in `with_progress({})` to get a .hi-pink[visible progress bar]. ```r with_progress({ par_slow_square(1:24) }) ``` --- # Tweaking Progress Bar There are a lot of [different progress bar options](https://cran.r-project.org/web/packages/progressr/vignettes/progressr-intro.html), including * Change the shape used in the ASCII progress bar ```r pacman::p_load(cli) handlers(handler_txtprogressbar(char = cli::col_red(cli::symbol$smiley))) with_progress({ par_slow_square(1:24) }) ``` --- # Tweaking Progress Bar There are a lot of [different progress bar options](https://cran.r-project.org/web/packages/progressr/vignettes/progressr-intro.html), including * Continuous color bar ```r handlers("cli") with_progress({ par_slow_square(1:24) }) ``` --- # Tweaking Progress Bar There are a lot of [different progress bar options](https://cran.r-project.org/web/packages/progressr/vignettes/progressr-intro.html), including * Audible beeps at start, intervals, and finish ```r pacman::p_load(beepr) handlers("cli", "beepr") with_progress({ par_slow_square(1:24) }) ``` --- # Tweaking Progress Bar We can customize the sounds more fully with `handler_beepr()`: ```r sound_path <- paste0(getwd(), "/images/finish.wav") handlers(list( "cli", handler_beepr( initiate = NA_integer_, # disable start sound update = NA_integer_, # disable progress sound finish = sound_path # set custom finish sound ) ) ) with_progress({ par_slow_square(1:10) }) ``` --- # Table of Contents 1. [Prologue](#prologue) 2. [If/Else Statements](#ifelse) 3. [For Loops](#for) 4. [Functions](#fun) 5. [Indirection and Name Injection](#indirect) 6. [Vectorization](#vec) 7. [Parallelization](#parallel)